When attempting to understand customers from their observed behavior, recorded as events, it is typically useful to sessionize the event stream to correspond to a single customer engagement. Also these individual sessions can be aggregated to give a high level characterization of each customer.
I have been talking about these issues for a few years. See my achives here and here. Most recently, I presented Structuring Data for Self-serve Customer Insights at ODSC West in November 2015.
For this example, we use the infamous 3.5 million row AOL search data set which was used in a recent Stanford class. See also the AOL Readme.
The goal of this exercise is to roll-up the individual search events to
library(readr) library(dplyr) library(ggplot2) library(knitr) library(dplyrExamples) fn <- "http://sing.stanford.edu/cs303-sp10/assignments/user-ct-test-collection-01.zip" download.file(fn, "aolzip") aol.tsv <- unzip("aolzip") t0 <- Sys.time() # fn <- system.file("extdata", "user-ct-test-collection-01.zip", package = "dplyrExamples") aol <- read_tsv(aol.tsv) (Elapsed <- Sys.time() - t0) glimpse(aol)
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
Each line in the data represents one of two types of events:
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.
The big idea is to model a visitor's search session. IOW, a visitor comes to the site and does one or more searches (with posible click-throughs) and then leaves the site. The goal is to summarize that activity in one record. Industry convention is a session ends if there is more that a 30 minute gap to the next event for a visitor. We will use that convention here.
For the purpose of this example, we are going to ignore data issues around multiple clicks per search and next page records.
This is the dplyr sequence you would use in production. The next section breaks down the process step-by-step.
t0 <- Sys.time() aol_sessions <- aol %>% arrange(AnonID, QueryTime) %>% group_by(AnonID) %>% mutate(Minutes_After_Last = difftime(QueryTime, lag(QueryTime), units = "mins"), New_Session_Flag = is.na(lag(AnonID)) | Minutes_After_Last > 30, Session_Seq_Num = cumsum(New_Session_Flag) ) %>% group_by(AnonID, Session_Seq_Num) %>% summarize(Session_Start_At = first(QueryTime), Number_Searches = n(), Number_Terms = n_distinct(Query), Session_Duration_Minutes = as.numeric(difftime(last(QueryTime), first(QueryTime), units = "mins")), Number_Clicks = sum(!is.na(ClickURL)) ) (Elapsed <- Sys.time() - t0) glimpse(aol_sessions)
Breaking down the above block of dplyr code...
It turns out the raw data file is already in this sort order, so the only visible change is the setting of the group AnonID.
glimpse(aol) groups(aol) aols <- aol %>% arrange(AnonID, QueryTime) %>% group_by(AnonID) glimpse(aols) groups(aols)
The session sequence number starts at 1 for each visitor and is incremented whenever the time interval from the last record is greater than 30 minutes. Figuring this out is done in three steps:
aols <- aols %>% mutate(Minutes_After_Last = difftime(QueryTime, lag(QueryTime), units = "mins"), New_Session_Flag = is.na(lag(AnonID)) | Minutes_After_Last > 30, Session_Seq_Num = cumsum(New_Session_Flag) ) glimpse(aols) kable(aols[14:30, -c(2, 4, 5)], caption = "Look at some interesting rows:")
This final step is straightforward.
aols <- aols %>% group_by(AnonID, Session_Seq_Num) %>% summarize(Session_Start_At = first(QueryTime), Number_Searches = n(), Number_Terms = n_distinct(Query), Session_Duration_Minutes = difftime(last(QueryTime), first(QueryTime), units = "mins"), Number_Clicks = sum(!is.na(ClickURL)) ) glimpse(aols)
ggplot(aol_sessions, aes(Session_Duration_Minutes)) + geom_histogram(binwidth = 10) + ggtitle("AOL Sessions - Distribution of Sessions Durations") + scale_y_log10() ggplot(aol_sessions, aes(Number_Clicks)) + geom_histogram(binwidth = 1) + ggtitle("AOL Sessions - Distribution of # Clicks in Session") + scale_y_log10() + xlim(1, 50) ggplot(aol_sessions, aes(as.Date(Session_Start_At))) + geom_histogram(binwidth = 1) + ggtitle("AOL Sessions - Distribution of Session Start Dates")
This is trivial and fast. We just group_by the visitor unique identifier and then use summarize()
to create the visitor metrics we are interested in.
t0 <- Sys.time() aol_visitors <- aol_sessions %>% group_by(AnonID) %>% summarize(Number_Sessions = n(), First_Session_At = min(Session_Start_At), Last_Session_At = max(Session_Start_At), Total_Duration_Minutes = as.numeric(sum(Session_Duration_Minutes)), Avg_Duration_Minutes = as.numeric(mean(Session_Duration_Minutes)), Median_Duration_Minutes = as.numeric(median(Session_Duration_Minutes)), Avg_Num_Searches = mean(Number_Searches), Median_Num_Searches = median(Number_Searches), Avg_Num_Clicks = mean(Number_Clicks), Median_Num_Clicks = median(Number_Clicks) ) (Elapsed <- Sys.time() - t0) glimpse(aol_visitors)
Basic metrics for the AOL Search data set:
r n_distinct(aol$AnonID)
r nrow(aol_sessions)
r nrow(aol)
r min(aol$QueryTime)
r max(aol$QueryTime)
ggplot(aol_visitors, aes(Number_Sessions)) + geom_histogram(binwidth = 1) + ggtitle("AOL Visitors - Distribution of Number of Sesisons") + scale_y_log10() + xlim(1, 200) ggplot(aol_visitors, aes(Avg_Duration_Minutes)) + geom_histogram(binwidth = 1) + ggtitle("AOL Visitors - Distribution of Average Session Duration") + scale_y_log10() + xlim(0, 60) ggplot(aol_visitors, aes(Avg_Num_Clicks)) + geom_histogram(binwidth = 1) + ggtitle("AOL Visitors - Distribution of Average Number of Click Throughs") + scale_y_log10() + xlim(0, 20)
The place to start, of course, is Hadley's vignettes in the dplyr package. Especially Introduction to dplyr and Window functions and grouped mutate/filter.
Now that Hadley is with RStudio, search their blog for dplyr; get the Data Wrangling Cheat Sheet; watch Data Wrangling with R & RStudio. To understand Hadley's current thinking about data analysis watch Pipelines for Data Analysis in R and The Grammar and Graphics of Data Science - the latter with Winston Chang.
Lastly, see Garrett & Hadley's chapter on data transform in their upcoming R for Data Science
To compare the dplyr windowing method with how it works in SQL see this simple example or Google '"partition by" sql', perhaps replacing 'sql' with your favorite DBMS; e.g. 'postgresql', 'redshift', etc.
This example did a full refresh of the aol_session and aol_visitor. In a production environment where new data come in, say, nightly we would only sessionize the new records and append them to existing records, if any, for the visitors in the nightly set.
(The brute force method is to let sessions end at the end of each batch load. Generally this is acceptable. You can check this on the fully processed sessions by seeing how many actually span the cut off time, eg midnight.)
Then update just those visitor level summaries for visitor IDs which were in the incremental batch.
I hope this dplyr example inspired you to add the library to your regular took set.
Please send comments and suggestions to Jim at DS4CI.org or leave an issue or pull request at my github.
Thanks! Jim
file.remove("user-ct-test-collection-01.txt") file.remove("aolzip")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.